Final Project - Heart Disease Prediction¶
Team Members (Group-6):¶
- Layanika Vinay Saravanan - 8934459
- Priyanka Chitikela - 8909667
Problem Statement¶
Coronary Heart Disease (CHD) remains a leading cause of morbidity and mortality worldwide, posing challenges to public health systems and individual well-being. Early identification of individuals at high risk is crucial for preventing severe outcomes through timely interventions like lifestyle modifications, medical treatments, and resource prioritization. However, the complex interplay of multiple risk factors, including age, blood pressure, cholesterol levels, diabetes, and lifestyle behaviors, complicates this process.
This project aims to develop a machine learning-based predictive model to classify individuals into CHD risk categories using demographic, biometric, and clinical data. By identifying key predictors, addressing class imbalance, and providing actionable insights, the project seeks to enhance early detection, reduce healthcare costs, and improve patient outcomes.
About Dataset:¶
The dataset contains 3,674 entries with 16 features, focusing on predicting coronary heart disease (CHD) risk. Key demographic attributes include age, sex, and education. Lifestyle factors, such as smoking status and cigarettes per day, are included alongside medical history like diabetes, hypertension, and stroke prevalence. Biometric measurements include total cholesterol, systolic and diastolic blood pressure, BMI, heart rate, and glucose levels. The target variable, CHDRisk, is binary (yes/no) but imbalanced, with more "no" cases. Some columns have missing values, which require preprocessing. The dataset combines categorical and numerical data, making it suitable for machine learning classification models.
Importing Necessary Libraries¶
Import libraries for data manipulation, visualization, and machine learning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
Loading and Exploring the Dataset¶
Load the dataset and preview the first few rows
data = pd.read_csv(r'C:\Users\Layanika.V.S\Desktop\Fall24\FML24\Final Project\Heart_Disease (1).csv')
print(data.head())
sex age education smokingStatus cigsPerDay BPMeds prevalentStroke \ 0 male 39 4 no 0 0 0 1 female 46 2 no 0 0 0 2 male 48 1 yes 20 0 0 3 female 61 3 yes 30 0 0 4 female 46 3 yes 23 0 0 prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose \ 0 0 no 195 106.0 70.0 26.97 80 77 1 0 no 250 121.0 81.0 28.73 95 76 2 0 no 245 127.5 80.0 25.34 75 70 3 1 no 225 150.0 95.0 28.58 65 103 4 0 no 285 130.0 84.0 23.10 85 85 CHDRisk 0 no 1 no 2 no 3 yes 4 no
Display basic information about the dataset to understand its structure
print(data.isnull().sum())
sex 11 age 0 education 0 smokingStatus 13 cigsPerDay 0 BPMeds 0 prevalentStroke 0 prevalentHyp 0 diabetes 0 totChol 0 sysBP 0 diaBP 0 BMI 0 heartRate 0 glucose 0 CHDRisk 0 dtype: int64
print(data.info())
print(data.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3674 entries, 0 to 3673
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 3663 non-null object
1 age 3674 non-null int64
2 education 3674 non-null int64
3 smokingStatus 3661 non-null object
4 cigsPerDay 3674 non-null int64
5 BPMeds 3674 non-null int64
6 prevalentStroke 3674 non-null int64
7 prevalentHyp 3674 non-null int64
8 diabetes 3674 non-null object
9 totChol 3674 non-null int64
10 sysBP 3674 non-null float64
11 diaBP 3674 non-null float64
12 BMI 3674 non-null float64
13 heartRate 3674 non-null int64
14 glucose 3674 non-null int64
15 CHDRisk 3674 non-null object
dtypes: float64(3), int64(9), object(4)
memory usage: 459.4+ KB
None
age education cigsPerDay BPMeds prevalentStroke \
count 3674.000000 3674.000000 3674.000000 3674.000000 3674.000000
mean 49.577300 1.984213 9.092270 0.030212 0.005716
std 8.546068 1.022891 11.938399 0.171194 0.075397
min 32.000000 1.000000 0.000000 0.000000 0.000000
25% 42.000000 1.000000 0.000000 0.000000 0.000000
50% 49.000000 2.000000 0.000000 0.000000 0.000000
75% 56.000000 3.000000 20.000000 0.000000 0.000000
max 70.000000 4.000000 70.000000 1.000000 1.000000
prevalentHyp totChol sysBP diaBP BMI \
count 3674.000000 3674.000000 3674.00000 3674.000000 3674.000000
mean 0.310016 236.761840 132.38024 82.906505 25.783038
std 0.462563 44.039295 22.04683 11.948024 4.056048
min 0.000000 113.000000 83.50000 48.000000 15.540000
25% 0.000000 206.000000 117.00000 75.000000 23.080000
50% 0.000000 234.000000 128.00000 82.000000 25.400000
75% 1.000000 263.000000 143.50000 89.500000 27.990000
max 1.000000 600.000000 295.00000 142.500000 56.800000
heartRate glucose
count 3674.000000 3674.000000
mean 75.719652 81.769461
std 11.957171 23.884454
min 44.000000 40.000000
25% 68.000000 71.000000
50% 75.000000 78.000000
75% 82.000000 87.000000
max 143.000000 394.000000
EDA - Exploratory Data Analysis:¶
import matplotlib.pyplot as plt
import seaborn as sns
# Count plot for target variable
sns.countplot(x='CHDRisk', data=data)
plt.title('Heart Disease Classification')
plt.show()
Interpretation of the Distribution of CHD Risk Graph:
- The graph shows the count of individuals in two categories of CHDRisk:
- 0: No risk of coronary heart disease (CHD).
- 1: At risk of coronary heart disease (CHD).
- The majority of individuals (over 3,000) are in the CHDRisk = 0 category, indicating no risk of CHD.
- A smaller group (under 700) falls into the CHDRisk = 1 category, representing individuals at risk.
- This imbalance can lead to bias in machine learning models toward predicting the majority class (CHDRisk = 0).
sns.pairplot(data, hue='CHDRisk')
plt.show()
A pairplot provides a matrix of scatterplots and histograms to visualize pairwise relationships and distributions of numerical features. In this pairplot, the features are compared, and the target variable CHDRisk is used for color coding (CHDRisk = 0 for blue and CHDRisk = 1 for orange).
Key Observations from the Pairplot:
Distributions:
- Features like age, sysBP, and BMI show clear variability and differences between CHDRisk = 0 and CHDRisk = 1.
- Features such as BPMeds and prevalentStroke are heavily skewed toward 0, indicating most individuals do not have these conditions.
Relationships:
- sysBP and diaBP: Strong positive correlation, reflecting typical blood pressure dynamics.
- age and CHDRisk: Older individuals are more likely to have CHDRisk = 1.
- BMI and CHDRisk: Higher BMI is associated with a greater risk of CHD.
- glucose and CHDRisk: Elevated glucose levels are more common among those with CHDRisk = 1.
CHDRisk Patterns:
- Orange points (CHDRisk = 1) are more dispersed in sysBP, diaBP, and age, indicating these features may strongly predict CHD risk.
- Features like education show no significant distinction between the classes.
Implications for Modeling:
- Key Features: sysBP, age, BMI, and glucose are strong predictors and should be prioritized.
- Weak Features: Variables like education and BPMeds may have limited predictive value.
- Non-linear models like Random Forests or Gradient Boosting may better handle overlaps in feature distributions.
# Filter numeric columns
numeric_data = data.select_dtypes(include=['float64', 'int64'])
# Compute the correlation matrix
correlation_matrix = numeric_data.corr()
# Visualize the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_data.corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix Heatmap")
plt.show()
Key Observations from the Correlation Matrix Heatmap
Strong Positive Correlations:
- sysBP and diaBP (~0.79): Systolic and diastolic blood pressure are highly correlated, reflecting their physiological relationship.
- prevalentHyp and sysBP (~0.70): A history of hypertension is strongly associated with higher systolic blood pressure.
Moderate Positive Correlations:
- age and sysBP (~0.39): Older individuals tend to have higher systolic blood pressure.
- age and prevalentHyp (~0.30): Older age is linked with a higher prevalence of hypertension.
- totChol and age (~0.27): Cholesterol levels tend to increase slightly with age.
Weak or No Correlation:
- Features like education, cigsPerDay, and prevalentStroke have weak correlations with most other variables, indicating limited direct influence.
Implications for Modeling:
- Strongly correlated features, such as sysBP and diaBP, may provide redundant information. Dimensionality reduction or careful selection is required to avoid multicollinearity.
- Features like age, sysBP, BMI, and prevalentHyp may serve as important predictors due to their moderate to strong relationships with key health indicators.
Data Preprocessing¶
Handling missing values, encoding categorical data, and scaling numerical features
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values per column:\n", missing_values)
# Fill missing values with the median for numerical columns
data.fillna(data.median(numeric_only=True), inplace=True)
# Optionally, drop rows with missing values
# data.dropna(inplace=True)
Missing values per column: sex 11 age 0 education 0 smokingStatus 13 cigsPerDay 0 BPMeds 0 prevalentStroke 0 prevalentHyp 0 diabetes 0 totChol 0 sysBP 0 diaBP 0 BMI 0 heartRate 0 glucose 0 CHDRisk 0 dtype: int64
# drop rows with missing values
data.dropna(inplace=True)
Defining the target variable
# Define the target variable
y = data['CHDRisk'] # Assuming 'target' is the column name for the label
# Drop the target column from the dataset to create features
X = data.drop(columns=['CHDRisk'])
# Display the first few rows of features and target
print("Features (X):\n", X.head())
print("Target (y):\n", y.head())
Features (X):
sex age education smokingStatus cigsPerDay BPMeds prevalentStroke \
0 male 39 4 no 0 0 0
1 female 46 2 no 0 0 0
2 male 48 1 yes 20 0 0
3 female 61 3 yes 30 0 0
4 female 46 3 yes 23 0 0
prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose
0 0 no 195 106.0 70.0 26.97 80 77
1 0 no 250 121.0 81.0 28.73 95 76
2 0 no 245 127.5 80.0 25.34 75 70
3 1 no 225 150.0 95.0 28.58 65 103
4 0 no 285 130.0 84.0 23.10 85 85
Target (y):
0 no
1 no
2 no
3 yes
4 no
Name: CHDRisk, dtype: object
Encoding using Dummies
X_encoded = pd.get_dummies(X, drop_first=True)
Scaling Features
Features were scaled using StandardScaler to standardize them with a mean of 0 and a standard deviation of 1, ensuring all features are on the same scale.
from sklearn.preprocessing import StandardScaler
# Initialize a StandardScaler
scaler = StandardScaler()
# Fit and transform the numerical features
X_scaled = scaler.fit_transform(X_encoded)
# Convert scaled features back to a DataFrame
X_scaled_df = pd.DataFrame(X_scaled, columns=X_encoded.columns)
print("Scaled Features (X):\n", X_scaled_df.head())
Scaled Features (X):
age education cigsPerDay BPMeds prevalentStroke prevalentHyp \
0 -1.236597 1.973967 -0.762182 -0.176227 -0.07605 -0.671516
1 -0.417496 0.016881 -0.762182 -0.176227 -0.07605 -0.671516
2 -0.183468 -0.961662 0.914163 -0.176227 -0.07605 -0.671516
3 1.337719 0.995424 1.752336 -0.176227 -0.07605 1.489168
4 -0.417496 0.995424 1.165615 -0.176227 -0.07605 -0.671516
totChol sysBP diaBP BMI heartRate glucose sex_male \
0 -0.947563 -1.197277 -1.081416 0.292359 0.355471 -0.199268 1.116245
1 0.300388 -0.516495 -0.159361 0.726858 1.608944 -0.241382 -0.895861
2 0.186938 -0.221490 -0.243184 -0.110046 -0.062353 -0.494064 1.116245
3 -0.266862 0.799682 1.014165 0.689826 -0.898002 0.895690 -0.895861
4 1.094539 -0.108027 0.092109 -0.663045 0.773295 0.137642 -0.895861
smokingStatus_yes diabetes_yes
0 -0.983165 -0.165183
1 -0.983165 -0.165183
2 1.017124 -0.165183
3 1.017124 -0.165183
4 1.017124 -0.165183
Key Observations:
- Negative values indicate below-average feature values, while positive values indicate above-average values.
- Binary features like sex_male and smokingStatus_yes are standardized but retain near-binary patterns.
# Check shapes of features and target
print(f"Shape of Features (X): {X_scaled_df.shape}")
print(f"Shape of Target (y): {y.shape}")
Shape of Features (X): (3652, 15) Shape of Target (y): (3652,)
Splitting Data for Model Training¶
Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.2, random_state=42, stratify=y)
# Check the shapes of the splits
print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Target Shape:", y_train.shape)
print("Testing Target Shape:", y_test.shape)
Training Features Shape: (2921, 15) Testing Features Shape: (731, 15) Training Target Shape: (2921,) Testing Target Shape: (731,)
Model Training¶
The Linear Regression model is used to predict CHDRisk (converted to numeric values: 1 for "yes" and 0 for "no").
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
# Convert non-numeric values in y_train and y_test to numeric (if applicable)
# For example, if your target variable has 'yes' and 'no', we replace them with 1 and 0
y_train = y_train.replace({'yes': 1, 'no': 0})
y_test = y_test.replace({'yes': 1, 'no': 0})
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate R², MSE, MAE
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
# Print the results
print("R² Score:", r2)
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
R² Score: 0.11224613831981356 Mean Squared Error (MSE): 0.11685514310150058 Mean Absolute Error (MAE): 0.2398565933421994
Evaluation Metrics:
- R² Score: 0.11
- Indicates that the model explains only 11% of the variance in the target variable, suggesting a poor fit for the data.
- Mean Squared Error (MSE): 0.117
- Represents the average squared difference between predicted and actual values. Lower values are better.
- Mean Absolute Error (MAE): 0.24
- Indicates the average absolute difference between predictions and actual values.
Insights:
- The low R² score shows that Linear Regression is not suitable for this problem, likely due to the binary nature of the target variable and its non-linear relationships with features.
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
# Initialize the Random Forest model
model = RandomForestClassifier(random_state=42)
# Train the model on the training data
model.fit(X_train, y_train)
print("Model training completed!")
Model training completed!
Logistic Regression
from sklearn.linear_model import LogisticRegression
# Initialize Logistic Regression
logreg_model = LogisticRegression(random_state=42, max_iter=1000)
# Train the model
logreg_model.fit(X_train, y_train)
print("Logistic Regression training completed!")
Logistic Regression training completed!
SVM Model
from sklearn.svm import SVC
# Initialize SVM
svm_model = SVC(random_state=42, kernel='linear', probability=True)
# Train the model
svm_model.fit(X_train, y_train)
print("SVM training completed!")
SVM training completed!
KNN Model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_curve, auc
import matplotlib.pyplot as plt
# Initialize KNN model
knn_model = KNeighborsClassifier(n_neighbors=5) # You can experiment with n_neighbors
# Train the model
knn_model.fit(X_train, y_train)
# Predict on the test set
y_pred_knn = knn_model.predict(X_test)
print("KNN training completed")
KNN training completed
Model Performance Evaluation¶
Random Forest Classifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=1) # Use 1 for the positive class
recall = recall_score(y_test, y_pred, pos_label=1) # Use 1 for the positive class
f1 = f1_score(y_test, y_pred, pos_label=1) # Use 1 for the positive class
# Print metrics
print("Random Forest Model Performance:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
# Display detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Random Forest Model Performance:
Accuracy: 0.85
Precision: 0.67
Recall: 0.09
F1 Score: 0.16
Classification Report:
precision recall f1-score support
0 0.85 0.99 0.92 617
1 0.67 0.09 0.16 114
accuracy 0.85 731
macro avg 0.76 0.54 0.54 731
weighted avg 0.83 0.85 0.80 731
Interpretation
- The model achieved an accuracy of 85%, indicating good performance on the test set. However, accuracy alone can be misleading due to class imbalance.
- The model performs well in predicting the majority class (CHDRisk = 0), with precision of 85% and recall of 99%.
- For the minority class (CHDRisk = 1), the performance drops significantly, with a recall of only 9% and an F1 score of 0.16, highlighting the model's struggle to identify individuals at risk.
- The macro average (unweighted mean of metrics across classes) is low for recall (0.54) and F1 score (0.54), reflecting poor balance in performance between classes.
- The weighted average (accounting for class imbalance) shows better results due to the dominance of the majority class.
Conclusion
- The model is biased towards the majority class (CHDRisk = 0) due to class imbalance.
- While the overall accuracy is high, the recall for the minority class (CHDRisk = 1) is too low for the model to be considered effective for identifying at-risk individuals.
Logistic Regression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.linear_model import LogisticRegression
# Fit the Logistic Regression model (assuming X_train and y_train are already defined)
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
# Predict on the test set
y_pred_logreg = logreg_model.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred_logreg)
precision = precision_score(y_test, y_pred_logreg, pos_label=1) # Use 1 as the positive label
recall = recall_score(y_test, y_pred_logreg, pos_label=1) # Use 1 as the positive label
f1 = f1_score(y_test, y_pred_logreg, pos_label=1) # Use 1 as the positive label
# Print metrics
print("Logistic Regression Model Performance:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
# Display detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred_logreg))
Logistic Regression Model Performance:
Accuracy: 0.86
Precision: 0.77
Recall: 0.15
F1 Score: 0.25
Classification Report:
precision recall f1-score support
0 0.86 0.99 0.92 617
1 0.77 0.15 0.25 114
accuracy 0.86 731
macro avg 0.82 0.57 0.59 731
weighted avg 0.85 0.86 0.82 731
Interpretation
- The model achieved an accuracy of 86%, slightly better than the Random Forest model. However, accuracy alone is insufficient due to class imbalance.
- High precision (86%) and recall (99%) indicate that the model predicts the majority class effectively.
- The recall for the minority class is 15%, meaning the model misses most individuals at risk.
- The F1 score of 0.25 reflects the model's limited effectiveness in balancing precision and recall for the minority class.
- Macro average recall (0.57) shows imbalance in performance between classes.
- Weighted average metrics are higher due to dominance of the majority class.
Conclusion
- While Logistic Regression performs better in precision and F1 score for the minority class compared to Random Forest, it still struggles with recall.
- The model is biased towards the majority class, making it less effective for identifying high-risk individuals.
SVM Model
import warnings
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.svm import SVC
# Suppress all warnings
warnings.filterwarnings('ignore')
# Fit the SVM model (assuming X_train and y_train are already defined)
svm_model = SVC()
svm_model.fit(X_train, y_train)
# Predict on the test set
y_pred_svm = svm_model.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred_svm)
precision = precision_score(y_test, y_pred_svm, pos_label=1) # Change 'yes' to 1
recall = recall_score(y_test, y_pred_svm, pos_label=1) # Change 'yes' to 1
f1 = f1_score(y_test, y_pred_svm, pos_label=1) # Change 'yes' to 1
# Print metrics
print("SVM Model Performance:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
# Display detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))
SVM Model Performance:
Accuracy: 0.84
Precision: 0.00
Recall: 0.00
F1 Score: 0.00
Classification Report:
precision recall f1-score support
0 0.84 1.00 0.92 617
1 0.00 0.00 0.00 114
accuracy 0.84 731
macro avg 0.42 0.50 0.46 731
weighted avg 0.71 0.84 0.77 731
Interpretation
- The model achieved an accuracy of 84%, which seems high but is misleading due to its failure to predict the minority class (CHDRisk = 1).
- The model performs well for the majority class with precision (84%), recall (100%), and an F1 score of 0.92.
- The model entirely fails to predict the minority class (CHDRisk = 1), with precision, recall, and F1 score of 0.00.
- This suggests the model is biased towards the majority class and disregards the minority class entirely.
- Macro average metrics (precision: 0.42, recall: 0.50) reflect the severe imbalance in performance across classes.
- Weighted averages are skewed by the majority class, showing slightly better performance due to the large size of CHDRisk = 0.
Conclusion
- The SVM model is ineffective for identifying individuals at risk of CHD (CHDRisk = 1).
- It highlights the need for addressing class imbalance and possibly tuning SVM parameters (e.g., using a different kernel or class weighting).
KNN
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred_knn)
precision = precision_score(y_test, y_pred_knn, pos_label=1)
recall = recall_score(y_test, y_pred_knn, pos_label=1)
f1 = f1_score(y_test, y_pred_knn, pos_label=1)
# Print metrics
print("KNN Model Performance:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
# Display classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))
KNN Model Performance:
Accuracy: 0.83
Precision: 0.38
Recall: 0.11
F1 Score: 0.16
Classification Report:
precision recall f1-score support
0 0.85 0.97 0.91 617
1 0.38 0.11 0.16 114
accuracy 0.83 731
macro avg 0.61 0.54 0.54 731
weighted avg 0.78 0.83 0.79 731
Interpretation
- The model achieved an accuracy of 83%, which is reasonable but heavily influenced by the majority class (CHDRisk = 0).
- The model performs well for the majority class with precision (85%), recall (97%), and an F1 score of 0.91.
- The model struggles with the minority class, achieving only 11% recall and an F1 score of 0.16. This indicates that the model misses most instances of CHDRisk = 1.
- The macro average recall (0.54) and F1 score (0.54) reveal a poor balance between the classes.
- The weighted averages are higher due to the dominance of the majority class.
Conclusion
- The KNN model is biased towards the majority class and underperforms in identifying individuals at risk (CHDRisk = 1).
- The low precision and recall for the minority class limit its usefulness in real-world applications.
Model Interpretation and Insights¶
# Get feature importances
importances = model.feature_importances_
# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
# Display the top features
print("Feature Importance:\n", feature_importance_df)
Feature Importance:
Feature Importance
7 sysBP 0.137490
9 BMI 0.127470
0 age 0.125110
11 glucose 0.122854
6 totChol 0.121238
8 diaBP 0.117746
10 heartRate 0.092414
2 cigsPerDay 0.052908
1 education 0.035582
12 sex_male 0.021318
5 prevalentHyp 0.018797
13 smokingStatus_yes 0.013011
3 BPMeds 0.006419
14 diabetes_yes 0.005124
4 prevalentStroke 0.002519
Conclusion
- Focus on Key Features: Features like sysBP, BMI, age, glucose, and totChol should be prioritized in model building and interpretation.
- Potential Redundancy: The strong importance of both sysBP and diaBP may indicate overlap; dimensionality reduction could be considered.
- Insights for Interventions: The importance of modifiable factors (e.g., BMI, glucose, cholesterol) highlights areas for preventive healthcare efforts.
Feature Importance for Random Forest
import matplotlib.pyplot as plt
# Plot feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance for Random Forest')
plt.gca().invert_yaxis()
plt.show()
Interpretation
- sysBP (Systolic Blood Pressure) is the most important feature, highlighting its critical role in predicting CHDRisk.
- BMI (Body Mass Index), age, glucose, and totChol (Total Cholesterol) also show high importance, emphasizing the significance of these health metrics.
- diaBP (Diastolic Blood Pressure) and heartRate contribute moderately, reflecting their influence on cardiovascular health.
- Features like cigsPerDay, education, and prevalentHyp (Hypertension History) have lower importance but may still provide some predictive value.
- BPMeds (Blood Pressure Medications), diabetes_yes, and prevalentStroke contribute minimally, indicating less relevance in this dataset.
Confusion Matrix for SVM Model
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Ensure X_test is defined or use the transformed test set (e.g., after PCA or scaling)
# For example, if you are using the raw test set:
X_test_to_use = X_test # Replace with X_test_2d if that's the correct variable
# Make predictions on the test set
y_pred = svm_model.predict(X_test_to_use)
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix using seaborn heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.title("Confusion Matrix for SVM Model")
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
Interpretation
True Positives (Class 1 Predicted as Class 1):
- Count: 0
- The SVM model failed to correctly predict any instances of the minority class (CHDRisk = 1).
True Negatives (Class 0 Predicted as Class 0):
- Count: 617
- The model correctly classified all instances of the majority class (CHDRisk = 0).
False Positives (Class 0 Predicted as Class 1):
- Count: 0
- No false positives, indicating no instances of Class 0 were incorrectly classified as Class 1.
False Negatives (Class 1 Predicted as Class 0):
- Count: 114
- All instances of the minority class (CHDRisk = 1) were misclassified as Class 0.
Insights:
- The SVM model is heavily biased towards the majority class (CHDRisk = 0) and fails to predict any instances of the minority class (CHDRisk = 1).
- This imbalance reflects poor recall and F1 score for the minority class, rendering the model ineffective for identifying individuals at risk of CHD.
ROC Curve for KNN
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import numpy as np
# Assuming your model is already trained and you are using KNN in this case
# Get the predicted probabilities for the positive class (class '1')
y_pred_proba = knn_model.predict_proba(X_test)[:, 1] # Get probabilities for class 1
# Ensure no NaN values in y_test by creating a mask for non-NaN values
mask = y_test.notna() # Create a mask for non-NaN values
y_test_clean = y_test[mask] # Cleaned y_test
y_pred_proba_clean = y_pred_proba[mask] # Cleaned predicted probabilities
# Debugging: Print the lengths before and after cleaning
print(f"Length of original y_test: {len(y_test)}")
print(f"Length of y_pred_proba: {len(y_pred_proba)}")
print(f"Length of cleaned y_test: {len(y_test_clean)}")
print(f"Length of cleaned y_pred_proba: {len(y_pred_proba_clean)}")
# Debugging: Check the unique values in y_test
print(f"Unique values in y_test: {y_test_clean.unique()}")
# Drop any remaining NaN values in y_test_clean (if any)
y_test_clean = y_test_clean.dropna()
# Ensure y_pred_proba_clean matches the cleaned y_test_clean
y_pred_proba_clean = y_pred_proba_clean[:len(y_test_clean)]
# Debugging: Check after dropping NaN values
print(f"Length of y_test_clean after dropping NaN: {len(y_test_clean)}")
# Check if y_test_clean and y_pred_proba_clean have the same length and are non-empty
if len(y_test_clean) == 0 or len(y_pred_proba_clean) == 0:
raise ValueError("y_test_clean or y_pred_proba_clean is empty. Check data cleaning process.")
# Check for NaN values in y_pred_proba_clean using numpy
if np.isnan(y_pred_proba_clean).sum() > 0:
raise ValueError("y_pred_proba_clean contains NaN values. Please clean the data.")
# Check if y_pred_proba_clean contains probabilities, not binary class labels
if not np.all((y_pred_proba_clean >= 0) & (y_pred_proba_clean <= 1)):
raise ValueError("y_pred_proba_clean should contain probabilities, not class labels.")
# Calculate the ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test_clean, y_pred_proba_clean, pos_label=1)
roc_auc = auc(fpr, tpr)
# Plotting the ROC Curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'KNN (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', lw=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for KNN')
plt.legend(loc="lower right")
plt.show()
Length of original y_test: 731 Length of y_pred_proba: 731 Length of cleaned y_test: 731 Length of cleaned y_pred_proba: 731 Unique values in y_test: [0 1] Length of y_test_clean after dropping NaN: 731
Interpretation
- The Area Under the Curve (AUC) is 0.68, indicating that the KNN model has modest discriminatory power but struggles to effectively distinguish between CHDRisk = 0 and CHDRisk = 1.
- The ROC curve is only slightly above the diagonal (random classifier line), suggesting that the model performs marginally better than random guessing.
- The True Positive Rate (TPR) increases with the False Positive Rate (FPR), but the curve's shape indicates limited ability to identify positive cases (CHDRisk = 1).
- The low AUC and the flat curve segments reflect the challenge posed by the imbalance in the dataset, where CHDRisk = 1 is the minority class.
ROC Curve for Logistic Regression
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Get the predicted probabilities for the positive class (class 1)
y_pred_proba = logreg_model.predict_proba(X_test)[:, 1] # Probabilities for class '1'
# Compute the ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
# Plotting the ROC curve
plt.figure(figsize=(8, 6))
# Calculating ROC AUC
plt.plot(fpr, tpr, color='blue', lw=2, label=f'Logistic Regression (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', lw=2) # Diagonal line for random classifier
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Logistic Regression')
plt.legend(loc="lower right")
plt.show()
Interpretation
- The Area Under the Curve (AUC) is 0.74, indicating moderate discriminatory power of the logistic regression model in distinguishing between CHDRisk = 0 and CHDRisk = 1.
- The ROC curve is well above the diagonal line, showing the model performs better than random guessing but leaves room for improvement, especially for identifying minority class instances (CHDRisk = 1).
- The curve highlights that the model balances between False Positive Rate (FPR) and True Positive Rate (TPR), achieving reasonable sensitivity for positive cases.
- While logistic regression provides a straightforward and interpretable approach, its AUC of 0.74 suggests the need for enhancements to improve performance, particularly for imbalanced data.
Future Improvements:¶
Integration of More Complex Risk Factors: Incorporate additional data sources such as genetic information, family history, and environmental factors to build a more comprehensive model for predicting CHD risk.
Real-time Risk Assessment: Develop a real-time prediction system that can be integrated into clinical workflows or mobile applications, providing healthcare professionals with immediate risk assessments for better decision-making.
Personalized Risk Profiles: Extend the model to generate personalized health risk profiles for individuals based on their lifestyle choices and health conditions, offering tailored prevention strategies.
Collaboration with Healthcare Systems: Partner with hospitals, clinics, or public health organizations to deploy the predictive model in real-world settings, validate its effectiveness, and gather feedback for further improvement.
Dynamic Risk Assessment: Develop a dynamic model that can update risk scores over time as new data becomes available, enabling healthcare systems to track changes in patients' health status and adjust preventive measures accordingly.
Conclusion:¶
This project aimed to predict the risk of Coronary Heart Disease (CHD) using a machine learning approach, leveraging demographic, biometric, and clinical data. Multiple models, including Random Forest, Logistic Regression, Support Vector Machine (SVM), and k-Nearest Neighbors (KNN), were evaluated to determine their effectiveness.
Logistic Regression emerged as the best-performing model, achieving an Area Under the Curve (AUC) of 0.74 and an accuracy of 86%. However, all models struggled with imbalanced data, particularly in identifying high-risk individuals (CHDRisk = 1), as evidenced by low recall and F1 scores for the minority class. Feature importance analysis highlighted systolic blood pressure (sysBP), BMI, age, glucose, and cholesterol as the most significant predictors of CHD.
Despite moderate success, the models revealed limitations in handling class imbalance and non-linear relationships. Future improvements could include advanced techniques like oversampling, under-sampling, or synthetic data generation (e.g., SMOTE) to balance the dataset and the use of ensemble methods or deep learning for better performance.
Overall, the study provides valuable insights into CHD risk factors, underscores the importance of early detection, and highlights the need for tailored interventions to improve health outcomes. Further refinements to the model could enhance its applicability in real-world healthcare settings.